University of San Francisco
We’ve just started to talk about relationships between multiple variables.
Why might we care a lot about relationships between multiple variables?
Identifying multivariate relationships helps us as marketers.
In this module, we focus on:
I’ve taken real data from rei.com and added a few simulated variables for class.
Download it from Canvas, and load it into R!
price
avg_rating
sales
mens
avg_size
avg_running
cushion
weight
sales
?sales
With continuous predictors (independent variables), we are seeing if some numeric variable predicts outcomes
Is there a meaningful relationship between these two variables?
mean
our best guess (model) when we know nothing else?lm()
to find better (linear) modelsIn our REI data, our outcome is sales
In our REI data, our outcome is sales
sales
(10.05)In our REI data, our outcome is sales
sales
(10.05)sales
(0.92)In our REI data, our outcome is sales
sales
(10.05)sales
(0.92)weight
In our REI data, our outcome is sales
weight
to predict sales
better
weight
predicts sales
?Is this scatterplot showing a strong relationship?
Are these scatterplots showing strong relationships?
Two ways to know with numbers:
lm()
in Rcor()
in RRegression does the following:
So what is linear regression?
reiData$sales
To understand this, let’s look the first 10 values of reiData
And I’m going to only keep the columns I think are useful for this example
weight sales
1 3.160536 10.069211
2 4.970014 10.131396
3 6.759033 8.856539
4 3.068526 10.783086
5 5.254373 11.149486
6 4.769999 10.239619
7 3.831546 10.078726
8 3.805244 10.128758
9 3.172416 10.696915
10 2.901195 9.825507
11 2.461316 10.569462
12 2.737338 10.440997
13 2.982250 9.584194
14 5.910042 11.216148
15 5.154571 9.845226
16 6.090791 9.356569
17 5.153109 10.137007
18 4.059393 9.513671
19 5.565751 9.480360
20 2.473582 11.151987
If we are interested in the sales
column, let’s see what it looks like:
sales
I’ll start by guessing the mean, or average
weight sales guess
1 3.160536 10.069211 10.16274
2 4.970014 10.131396 10.16274
3 6.759033 8.856539 10.16274
4 3.068526 10.783086 10.16274
5 5.254373 11.149486 10.16274
6 4.769999 10.239619 10.16274
7 3.831546 10.078726 10.16274
8 3.805244 10.128758 10.16274
9 3.172416 10.696915 10.16274
10 2.901195 9.825507 10.16274
11 2.461316 10.569462 10.16274
12 2.737338 10.440997 10.16274
13 2.982250 9.584194 10.16274
14 5.910042 11.216148 10.16274
15 5.154571 9.845226 10.16274
16 6.090791 9.356569 10.16274
17 5.153109 10.137007 10.16274
18 4.059393 9.513671 10.16274
19 5.565751 9.480360 10.16274
20 2.473582 11.151987 10.16274
How wrong was I?
weight sales guess how_wrong how_wrong2
1 3.160536 10.069211 10.16274 0.09353188 0.0087482124
2 4.970014 10.131396 10.16274 0.03134738 0.0009826582
3 6.759033 8.856539 10.16274 1.30620416 1.7061693156
4 3.068526 10.783086 10.16274 -0.62034322 0.3848257155
5 5.254373 11.149486 10.16274 -0.98674322 0.9736621774
6 4.769999 10.239619 10.16274 -0.07687579 0.0059098874
7 3.831546 10.078726 10.16274 0.08401714 0.0070588794
8 3.805244 10.128758 10.16274 0.03398506 0.0011549846
9 3.172416 10.696915 10.16274 -0.53417160 0.2853392931
10 2.901195 9.825507 10.16274 0.33723672 0.1137286034
11 2.461316 10.569462 10.16274 -0.40671836 0.1654198253
12 2.737338 10.440997 10.16274 -0.27825397 0.0774252726
13 2.982250 9.584194 10.16274 0.57854943 0.3347194427
14 5.910042 11.216148 10.16274 -1.05340448 1.1096609896
15 5.154571 9.845226 10.16274 0.31751697 0.1008170242
16 6.090791 9.356569 10.16274 0.80617426 0.6499169379
17 5.153109 10.137007 10.16274 0.02573597 0.0006623402
18 4.059393 9.513671 10.16274 0.64907221 0.4212947353
19 5.565751 9.480360 10.16274 0.68238311 0.4656467078
20 2.473582 11.151987 10.16274 -0.98924365 0.9786030031
The column how_wrong2
is equal to guess
- sales
, squared - …i.e., the guess, minus the real data, squared
To see if that’s the best guess we could have made:
I’ll guess something else
weight sales guess how_wrong how_wrong2
1 3.160536 10.069211 11 0.09353188 0.0087482124
2 4.970014 10.131396 11 0.03134738 0.0009826582
3 6.759033 8.856539 11 1.30620416 1.7061693156
4 3.068526 10.783086 11 -0.62034322 0.3848257155
5 5.254373 11.149486 11 -0.98674322 0.9736621774
6 4.769999 10.239619 11 -0.07687579 0.0059098874
7 3.831546 10.078726 11 0.08401714 0.0070588794
8 3.805244 10.128758 11 0.03398506 0.0011549846
9 3.172416 10.696915 11 -0.53417160 0.2853392931
10 2.901195 9.825507 11 0.33723672 0.1137286034
11 2.461316 10.569462 11 -0.40671836 0.1654198253
12 2.737338 10.440997 11 -0.27825397 0.0774252726
13 2.982250 9.584194 11 0.57854943 0.3347194427
14 5.910042 11.216148 11 -1.05340448 1.1096609896
15 5.154571 9.845226 11 0.31751697 0.1008170242
16 6.090791 9.356569 11 0.80617426 0.6499169379
17 5.153109 10.137007 11 0.02573597 0.0006623402
18 4.059393 9.513671 11 0.64907221 0.4212947353
19 5.565751 9.480360 11 0.68238311 0.4656467078
20 2.473582 11.151987 11 -0.98924365 0.9786030031
How wrong was I?
weight sales guess how_wrong how_wrong2
1 3.160536 10.069211 11 0.9307886 0.86636747
2 4.970014 10.131396 11 0.8686041 0.75447313
3 6.759033 8.856539 11 2.1434609 4.59442468
4 3.068526 10.783086 11 0.2169135 0.04705148
5 5.254373 11.149486 11 -0.1494865 0.02234620
6 4.769999 10.239619 11 0.7603810 0.57817920
7 3.831546 10.078726 11 0.9212739 0.84874557
8 3.805244 10.128758 11 0.8712418 0.75906230
9 3.172416 10.696915 11 0.3030852 0.09186061
10 2.901195 9.825507 11 1.1744935 1.37943490
11 2.461316 10.569462 11 0.4305384 0.18536330
12 2.737338 10.440997 11 0.5590028 0.31248410
13 2.982250 9.584194 11 1.4158062 2.00450713
14 5.910042 11.216148 11 -0.2161477 0.04671984
15 5.154571 9.845226 11 1.1547737 1.33350233
16 6.090791 9.356569 11 1.6434310 2.70086548
17 5.153109 10.137007 11 0.8629927 0.74475644
18 4.059393 9.513671 11 1.4863290 2.20917378
19 5.565751 9.480360 11 1.5196399 2.30930530
20 2.473582 11.151987 11 -0.1519869 0.02310002
The column how_wrong2
is equal to guess
- sales
, squared - …i.e., the guess, minus the real data, squared
I’ll guess something else
weight sales guess how_wrong how_wrong2
1 3.160536 10.069211 9 0.9307886 0.86636747
2 4.970014 10.131396 9 0.8686041 0.75447313
3 6.759033 8.856539 9 2.1434609 4.59442468
4 3.068526 10.783086 9 0.2169135 0.04705148
5 5.254373 11.149486 9 -0.1494865 0.02234620
6 4.769999 10.239619 9 0.7603810 0.57817920
7 3.831546 10.078726 9 0.9212739 0.84874557
8 3.805244 10.128758 9 0.8712418 0.75906230
9 3.172416 10.696915 9 0.3030852 0.09186061
10 2.901195 9.825507 9 1.1744935 1.37943490
11 2.461316 10.569462 9 0.4305384 0.18536330
12 2.737338 10.440997 9 0.5590028 0.31248410
13 2.982250 9.584194 9 1.4158062 2.00450713
14 5.910042 11.216148 9 -0.2161477 0.04671984
15 5.154571 9.845226 9 1.1547737 1.33350233
16 6.090791 9.356569 9 1.6434310 2.70086548
17 5.153109 10.137007 9 0.8629927 0.74475644
18 4.059393 9.513671 9 1.4863290 2.20917378
19 5.565751 9.480360 9 1.5196399 2.30930530
20 2.473582 11.151987 9 -0.1519869 0.02310002
How wrong was I?
weight sales guess how_wrong how_wrong2
1 3.160536 10.069211 9 -1.0692114 1.14321296
2 4.970014 10.131396 9 -1.1313959 1.28005662
3 6.759033 8.856539 9 0.1434609 0.02058103
4 3.068526 10.783086 9 -1.7830865 3.17939738
5 5.254373 11.149486 9 -2.1494865 4.62029208
6 4.769999 10.239619 9 -1.2396190 1.53665537
7 3.831546 10.078726 9 -1.0787261 1.16365003
8 3.805244 10.128758 9 -1.1287582 1.27409505
9 3.172416 10.696915 9 -1.6969148 2.87952000
10 2.901195 9.825507 9 -0.8255065 0.68146104
11 2.461316 10.569462 9 -1.5694616 2.46320975
12 2.737338 10.440997 9 -1.4409972 2.07647300
13 2.982250 9.584194 9 -0.5841938 0.34128242
14 5.910042 11.216148 9 -2.2161477 4.91131075
15 5.154571 9.845226 9 -0.8452263 0.71440747
16 6.090791 9.356569 9 -0.3565690 0.12714145
17 5.153109 10.137007 9 -1.1370073 1.29278555
18 4.059393 9.513671 9 -0.5136710 0.26385794
19 5.565751 9.480360 9 -0.4803601 0.23074587
20 2.473582 11.151987 9 -2.1519869 4.63104763
The column how_wrong2
is equal to guess
- sales
, squared - …i.e., the guess, minus the real data, squared
To see if that’s the best guess we could have made:
sales
to the maximumfor()
loopbestguess <- min(simulation.results$how_wrong2)
simulation.results[simulation.results$how_wrong2 == bestguess,]
guess how_wrong2
1 10.16274 7.791746
simulation.results.median <- data.frame(
guess = c(median(reiDataSmall$sales),
seq(length.out = sims-1,
from = min(reiDataSmall$sales),
to = max(reiDataSmall$sales))),
how_wrong = rep(NA, times = sims)
)
for( i in 1:sims){
simulation.results.median$how_wrong[i] <- sum(abs(simulation.results.median$guess[i] - reiDataSmall$sales))
}
The median will minimize the sum of absolute error:
If we did not know a shoe’s weight
, what would be our best guess of sales
?
Graphically, that “guess” looks like this:
But what if we have information about their weight?
Instead, what is our new best guess?
weight
What is our new best guess?
sims <- 10000
simulation.results <- data.frame(
guess.intercept = runif(n = sims,
min = 400,
max = 500),
guess.slope = runif(n = sims,
min = 0,
max = 15),
how_wrong2 = rep(NA, times = sims)
)
for( i in 1:sims){
for( j in 1:nrow(reiDataSmall)){
reiDataSmall$how_wrong2[j] <- ((simulation.results$guess.intercept[i] + simulation.results$guess.slope[i] * reiDataSmall$weight[j]) - reiDataSmall$sales[j])^2
}
simulation.results$how_wrong2[i] <- sum(reiDataSmall$how_wrong2)
}
What is our new best guess?
guess.intercept guess.slope how_wrong2
848 400.0106 0.1137502 3047116
To get the exact numbers
lm()
to find the coefficient for weight
That’s how we end up with this line:
These best guesses are now different at each level of weight:
weight.lm <- lm(data = reiDataSmall,
formula = sales ~ weight)
reiDataSmall$bestguess.lm <- weight.lm$coef["(Intercept)"] +
weight.lm$coef["weight"] * reiDataSmall$weight
reiDataSmall$how_wrong2.lm <- ( reiDataSmall$bestguess.lm - reiDataSmall$sales)^2
reiDataSmall[, c('weight', 'bestguess.lm')]
weight bestguess.lm
1 3.160536 10.348593
2 4.970014 10.029385
3 6.759033 9.713786
4 3.068526 10.364824
5 5.254373 9.979221
6 4.769999 10.064669
7 3.831546 10.230221
8 3.805244 10.234861
9 3.172416 10.346497
10 2.901195 10.394343
11 2.461316 10.471942
12 2.737338 10.423249
13 2.982250 10.380044
14 5.910042 9.863555
15 5.154571 9.996827
16 6.090791 9.831670
17 5.153109 9.997085
18 4.059393 10.190026
19 5.565751 9.924291
20 2.473582 10.469778
And what are the residuals if we take weight
into account?
And what are the residuals if we take weight
into account?
weight bestguess.lm how_wrong2.lm
1 3.160536 10.348593 0.0780539703
2 4.970014 10.029385 0.0104062769
3 6.759033 9.713786 0.7348715017
4 3.068526 10.364824 0.1749433569
5 5.254373 9.979221 1.3695209324
6 4.769999 10.064669 0.0306075101
7 3.831546 10.230221 0.0229505609
8 3.805244 10.234861 0.0112577147
9 3.172416 10.346497 0.1227925882
10 2.901195 10.394343 0.3235748427
11 2.461316 10.471942 0.0095101693
12 2.737338 10.423249 0.0003150111
13 2.982250 10.380044 0.6333775757
14 5.910042 9.863555 1.8295062522
15 5.154571 9.996827 0.0229828344
16 6.090791 9.831670 0.2257205210
17 5.153109 9.997085 0.0195782530
18 4.059393 10.190026 0.4574564464
19 5.565751 9.924291 0.1970749260
20 2.473582 10.469778 0.4654095167
Is that smaller than before?
Use summary()
around lm()
to find the coefficient and p-value weight
Call:
lm(formula = sales ~ weight, data = reiDataSmall)
Residuals:
Min 1Q Median 3Q Max
-0.85725 -0.45172 -0.04418 0.21882 1.35259
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.9061 0.4642 23.496 5.87e-15 ***
weight -0.1764 0.1053 -1.676 0.111
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6119 on 18 degrees of freedom
Multiple R-squared: 0.135, Adjusted R-squared: 0.08694
F-statistic: 2.809 on 1 and 18 DF, p-value: 0.111
weight
and sales
What (I think) you should take away from this lesson if nothing else.
x
, what is my best guess about the value of y
?”We can use regression as just an extension of a t-test
Does cushion impact sales?
Plot!
What we are basically doing is comparing these plots:
And answering the question:
Is it worth knowing that we can separate the data into these two groups?
So let’s see!
Using the lm()
function in R:
lm(
data = reiData,
formula = sales ~ cushion ) |> # The dv is sales, IV is "cushion"
summary() # Summarise it
Call:
lm(formula = sales ~ cushion, data = reiData)
Residuals:
Min 1Q Median 3Q Max
-3.3956 -0.4987 0.0878 0.6200 3.1414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.32820 0.04648 222.193 <2e-16 ***
cushionLow -0.69344 0.07466 -9.288 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9583 on 692 degrees of freedom
Multiple R-squared: 0.1108, Adjusted R-squared: 0.1096
F-statistic: 86.26 on 1 and 692 DF, p-value: < 2.2e-16
What do you notice about this result, compared to the one from the t-test?
Welch Two Sample t-test
data: reiData[reiData$cushion == "Low", "sales"] and reiData[reiData$cushion == "High", "sales"]
t = -8.9923, df = 509.88, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.8449394 -0.5419355
sample estimates:
mean of x mean of y
9.634764 10.328201
It’s the same thing!
Why is the result the same thing?
Let’s extend this.
Read the entire data back in, which has the third group
I want to know:
cushion
is worth knowing for sales
:
Null hypothesis: That there is no difference in sales
between cushion levels.
Call:
lm(formula = sales ~ cushion, data = reiData)
Residuals:
Min 1Q Median 3Q Max
-3.3956 -0.4759 0.0694 0.5737 3.1414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.32820 0.04280 241.313 < 2e-16 ***
cushionLow -0.69344 0.06875 -10.087 < 2e-16 ***
cushionMedium -0.28409 0.06485 -4.381 1.31e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8823 on 1019 degrees of freedom
Multiple R-squared: 0.09084, Adjusted R-squared: 0.08905
F-statistic: 50.91 on 2 and 1019 DF, p-value: < 2.2e-16
reiData |>
group_by(cushion) |>
summarise(sales = mean(sales, na.rm = T)) |>
ggplot(
aes(x = cushion,
y = sales)) +
geom_bar(stat = 'identity', fill = c('seagreen', 'goldenrod', 'seagreen3')) +
theme_bw()
Conclusion: There is likely a difference between conditions, as the likelihood that a difference this large arises by chance alone is < .001%.
To make the plot less gross
reiData$cushion <- factor(reiData$cushion, levels = c("Low", "Medium", "High"))
lm(
data = reiData,
sales ~ cushion) |>
summary()
Call:
lm(formula = sales ~ cushion, data = reiData)
Residuals:
Min 1Q Median 3Q Max
-3.3956 -0.4759 0.0694 0.5737 3.1414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.63476 0.05380 179.09 < 2e-16 ***
cushionMedium 0.40935 0.07258 5.64 2.2e-08 ***
cushionHigh 0.69344 0.06875 10.09 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8823 on 1019 degrees of freedom
Multiple R-squared: 0.09084, Adjusted R-squared: 0.08905
F-statistic: 50.91 on 2 and 1019 DF, p-value: < 2.2e-16
But what if I want to know:
cushion
is “High” specifically
lm()
Answer
Null hypothesis: That there is no difference in sales
between cushion high and the combination of low and medium.
reiData$cushionHigh <- ifelse(reiData$cushion == "High", 1, 0)
lm(
data = reiData,
sales ~ cushionHigh) |>
summary()
Call:
lm(formula = sales ~ cushionHigh, data = reiData)
Residuals:
Min 1Q Median 3Q Max
-3.6205 -0.4556 0.0827 0.5721 3.1414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.85967 0.03665 268.998 < 2e-16 ***
cushionHigh 0.46853 0.05684 8.243 5.12e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8956 on 1020 degrees of freedom
Multiple R-squared: 0.06246, Adjusted R-squared: 0.06154
F-statistic: 67.95 on 1 and 1020 DF, p-value: 5.116e-16
reiData |>
group_by(cushionHigh) |>
summarise(sales = mean(sales, na.rm = T)) |>
ggplot(
aes(x = cushionHigh,
y = sales)) +
geom_bar(stat = 'identity', fill = c('seagreen', 'goldenrod')) +
theme_bw()
Conclusion: There is likely a difference.
But what if I want to know whether the relationship between the three conditions is linear?
reiData$cushionLinear <- ifelse( reiData$cushion == "Low", -1,
ifelse(reiData$cushion == "Medium", 0, 1))
summary(lm(data = reiData, sales ~ cushionLinear))
Call:
lm(formula = sales ~ cushionLinear, data = reiData)
Residuals:
Min 1Q Median 3Q Max
-3.4211 -0.4745 0.0694 0.5697 3.1253
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.00230 0.02809 356.11 <2e-16 ***
cushionLinear 0.34204 0.03408 10.04 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8824 on 1020 degrees of freedom
Multiple R-squared: 0.08985, Adjusted R-squared: 0.08896
F-statistic: 100.7 on 1 and 1020 DF, p-value: < 2.2e-16
Controlling for variables
Sometimes we think that some… other variable predicts our outcome, and we want to take its effect away.
cushion
, but we know that cushion effects weight
weight
, we can add (+
) it to our model:
Call:
lm(formula = sales ~ cushion + weight, data = reiData)
Residuals:
Min 1Q Median 3Q Max
-3.1702 -0.4305 0.0788 0.5436 3.3486
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.63808 0.11750 73.516 <2e-16 ***
cushionMedium 0.06406 0.07865 0.815 0.416
cushionHigh -0.13318 0.10961 -1.215 0.225
weight 0.41007 0.04343 9.442 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8465 on 1018 degrees of freedom
Multiple R-squared: 0.164, Adjusted R-squared: 0.1616
F-statistic: 66.59 on 3 and 1018 DF, p-value: < 2.2e-16
Rules for controlling for variables:
First:
DV
alone
Call:
lm(formula = sales ~ weight, data = reiData)
Residuals:
Min 1Q Median 3Q Max
-3.1616 -0.4220 0.0828 0.5490 3.2713
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.78927 0.09499 92.53 <2e-16 ***
weight 0.35751 0.02577 13.87 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8484 on 1020 degrees of freedom
Multiple R-squared: 0.1587, Adjusted R-squared: 0.1579
F-statistic: 192.4 on 1 and 1020 DF, p-value: < 2.2e-16
DV
, it can’t affect our resultRules for controlling for variables:
Second:
IV
on the control
Call:
lm(formula = weight ~ cushion, data = reiData)
Residuals:
Min 1Q Median 3Q Max
-2.0509 -0.3196 -0.0154 0.2716 3.0308
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.43052 0.03723 65.29 <2e-16 ***
cushionMedium 0.84201 0.05023 16.77 <2e-16 ***
cushionHigh 2.01580 0.04757 42.37 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6106 on 1019 degrees of freedom
Multiple R-squared: 0.6494, Adjusted R-squared: 0.6487
F-statistic: 943.7 on 2 and 1019 DF, p-value: < 2.2e-16
If the control predicts the DV
AND IV
predicts control:
IV
is confounded with the control variable”
IV
leads to DV
” fromDV
, and that is why it looks like IV
leads to DV
”If the control predicts the DV
BUT IV
DOES NOT predict control:
IV
appear more reliable
DV
that our IV
has to explainTo know from this:
DV
?IV
also predicts our control?Are these scatterplots showing strong relationships?
Two ways to know with numbers:
lm()
in Rcor()
in RSpecifically, this is a Pearson correlation coefficient
For more than two variables, you can compute the correlations between all pairs x, y at once as a correlation matrix.
Problem with correlation:
We can’t make predictions with correlations
y
increases as x
increases” generally
y
increases this much as x
increases this much”